Task 2

Mencía Gómez Luna - id document: 05468738-M

Instructions (read before starting)

  • Modify inside the .qmd document your personal data (name and ID) located in the header of the file. Do not touch anything else in the header (note that I have included embed-resources: true so that everything is contained in a single html without extra files, and theme: [style.scss] to give a cuckoo style to the delivery with the style.scss file in the folder).

  • Make sure, BEFORE further editing the document, that the .qmd file is rendered correctly and the corresponding .html is generated in your local folder on your computer. The chunks (code boxes) created are either empty or incomplete, hence most of them have the #| eval: false option. Once you edit what you consider, you must change each chunck to #| eval: true (or remove it directly) to run them.

  • Remember that you can run chunk by chunk with the play button or run all chunks up to a given chunk (with the button to the left of the previous one)

Case study: analysis of covid data

Required packages

Add in the chunck below all the packages you need

rm(list = ls()) # Remove old variables

library(tidyverse)
library(readxl)
library (tidyverse)
library (dplyr)
library(microbenchmark)
library(lubridate)
library(rvest)
library (datapasta)
library (zoo)

Question 1

In the project folder you have the dataset messy_covid_data.xlsx. Take a look at the {readxl} package and load the file correctly.

Question 1

data <- read_excel("messy_covid_data.xlsx")
print (data)
# A tibble: 43,301 × 32
   provincia_iso fecha               `0-9_H` `10-19_H` `20-29_H` `30-39_H`
   <chr>         <dttm>                <dbl>     <dbl>     <dbl>     <dbl>
 1 A             2020-01-01 00:00:00       0         0         0         0
 2 AB            2020-01-01 00:00:00       0         0         0         0
 3 AL            2020-01-01 00:00:00       0         0         0         0
 4 AV            2020-01-01 00:00:00       0         0         0         0
 5 B             2020-01-01 00:00:00       0         0         0         0
 6 BA            2020-01-01 00:00:00       0         0         0         0
 7 BI            2020-01-01 00:00:00       0         0         0         0
 8 BU            2020-01-01 00:00:00       0         0         0         0
 9 C             2020-01-01 00:00:00       0         0         0         0
10 CA            2020-01-01 00:00:00       0         0         0         0
# ℹ 43,291 more rows
# ℹ 26 more variables: `40-49_H` <dbl>, `50-59_H` <dbl>, `60-69_H` <dbl>,
#   `70-79_H` <dbl>, `80+_H` <dbl>, NC_H <dbl>, `0-9_M` <dbl>, `10-19_M` <dbl>,
#   `20-29_M` <dbl>, `30-39_M` <dbl>, `40-49_M` <dbl>, `50-59_M` <dbl>,
#   `60-69_M` <dbl>, `70-79_M` <dbl>, `80+_M` <dbl>, NC_M <dbl>,
#   `0-9_NC` <dbl>, `10-19_NC` <dbl>, `20-29_NC` <dbl>, `30-39_NC` <dbl>,
#   `40-49_NC` <dbl>, `50-59_NC` <dbl>, `60-69_NC` <dbl>, `70-79_NC` <dbl>, …

Column names code sex (H male, M female, NC as missing) and age group (0-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, ≥80 years and NC as missing).

Question 2

Design a loop that goes through all rows and, except for the first two columns, converts each 0 found in the row to an NA.

My computer doesn’t have the capacity to render this loop (it stays rendering per hours), but it would look like this (but without the #):

#for (i in 1:nrow(data)) {
  #data[i, -(1:2)] <- lapply(data[i, -(1:2)], function(x) ifelse(x == 0, NA, x))}

Question 3

Perform the above action with the original table but in tidyverse mode: loops, brackets and dollars are forbidden. Make it as generally applicable as possible.

data <- data |>
  mutate(across(-c(1, 2), ~ ifelse(. == 0, NA, .)))

print (data)
# A tibble: 43,301 × 32
   provincia_iso fecha               `0-9_H` `10-19_H` `20-29_H` `30-39_H`
   <chr>         <dttm>                <dbl>     <dbl>     <dbl>     <dbl>
 1 A             2020-01-01 00:00:00      NA        NA        NA        NA
 2 AB            2020-01-01 00:00:00      NA        NA        NA        NA
 3 AL            2020-01-01 00:00:00      NA        NA        NA        NA
 4 AV            2020-01-01 00:00:00      NA        NA        NA        NA
 5 B             2020-01-01 00:00:00      NA        NA        NA        NA
 6 BA            2020-01-01 00:00:00      NA        NA        NA        NA
 7 BI            2020-01-01 00:00:00      NA        NA        NA        NA
 8 BU            2020-01-01 00:00:00      NA        NA        NA        NA
 9 C             2020-01-01 00:00:00      NA        NA        NA        NA
10 CA            2020-01-01 00:00:00      NA        NA        NA        NA
# ℹ 43,291 more rows
# ℹ 26 more variables: `40-49_H` <dbl>, `50-59_H` <dbl>, `60-69_H` <dbl>,
#   `70-79_H` <dbl>, `80+_H` <dbl>, NC_H <dbl>, `0-9_M` <dbl>, `10-19_M` <dbl>,
#   `20-29_M` <dbl>, `30-39_M` <dbl>, `40-49_M` <dbl>, `50-59_M` <dbl>,
#   `60-69_M` <dbl>, `70-79_M` <dbl>, `80+_M` <dbl>, NC_M <dbl>,
#   `0-9_NC` <dbl>, `10-19_NC` <dbl>, `20-29_NC` <dbl>, `30-39_NC` <dbl>,
#   `40-49_NC` <dbl>, `50-59_NC` <dbl>, `60-69_NC` <dbl>, `70-79_NC` <dbl>, …

Question 4

Design a function to test both methods using the {microbenchmark} package in terms of time efficiency.

Question 4

I create a random sample to try the function on to make it easier and faster

set.seed(1234)
sample_data <- data |> slice_sample(n = 1000)

#Then I compare methods
method_loop <- function(data) {
  for (i in 1:nrow(data)) {
    data[i, -(1:2)] <- lapply(data[i, -(1:2)], function(x) ifelse(x == 0, NA, x))
  }
  return(data)
}

method_tidyverse <- function(data) {
  data |>
    mutate(across(-c(1, 2), ~ ifelse(. == 0, NA, .)))
}

benchmark_results <- microbenchmark(
  loop_method = method_loop(sample_data),
  tidyverse_method = method_tidyverse(sample_data),
  times = 30 #number of times each method should be run to measure its performance (it could be more than 30 but 30 is reasonable and faster)
)

Question 4

print(benchmark_results)
Unit: milliseconds
             expr      min       lq       mean   median       uq      max neval
      loop_method 468.5942 476.9987 517.716480 482.7577 535.5714 764.3472    30
 tidyverse_method   5.4045   5.7116   6.101987   5.8884   6.1217  10.9894    30

ANSWER: When comparing the two methods on a random sample of 1000 cases, we see that the loop method takes much longer (time) than the tidyverse method. Specifically, with the loop method it takes ~800 milliseconds on average and the tidyverse method takes ~8 milliseconds on average (each time it is rendered the values vary a little)

Question 5

Reasons why the data is not tidydata and converts to tidydata, deleting rows as appropriate.

tidy_covid <- data |>
  pivot_longer(
    cols = -c(provincia_iso, fecha),
    names_to = "age_sexo", 
    values_to = "cases"
  ) |>
  separate(
    age_sexo,
    into = c("age_group", "sex"), 
    sep = "_",
    remove = FALSE
  ) |>
  select(-age_sexo) 

print (tidy_covid, width = Inf) #to show all columns
# A tibble: 1,299,030 × 5
   provincia_iso fecha               age_group sex   cases
   <chr>         <dttm>              <chr>     <chr> <dbl>
 1 A             2020-01-01 00:00:00 0-9       H        NA
 2 A             2020-01-01 00:00:00 10-19     H        NA
 3 A             2020-01-01 00:00:00 20-29     H        NA
 4 A             2020-01-01 00:00:00 30-39     H        NA
 5 A             2020-01-01 00:00:00 40-49     H        NA
 6 A             2020-01-01 00:00:00 50-59     H        NA
 7 A             2020-01-01 00:00:00 60-69     H        NA
 8 A             2020-01-01 00:00:00 70-79     H        NA
 9 A             2020-01-01 00:00:00 80+       H        NA
10 A             2020-01-01 00:00:00 NC        H        NA
# ℹ 1,299,020 more rows

Question 5

ANSWER: The original data is not tidy because of some reasons: First of all, each variable is not a single column (f.e: age is not a variable, we have more: 0-9_H, 10-19_H, etc). Also, as we see in this example, we have a lot of columns where gender is mixed with the age. So, we don’t have a column for “age” and another for “gender”, so again each variable is not a single column. Also, there are missing rows for each unique observation: Each combination of province, date, age, gender, and category should be a separate row, otherwise we will have missing values for complex combinations like “for X province, Age Y and Gender Z we don’t have value”

Question 6

One of the columns we have coded sometimes as thing_thing, sometimes as thing+_thing, sometimes as NC_thing. Try to separate that column to generate three new columns age_inf, age_up and sex properly. For example, if I have 10-19_H I will have to send the 10 to one column, the 19 to another and H to another; if I have 80+_H I will have to send 80 to one, NA to another (there is no upper bound) and H to another; if I have NC_H you will have to have NA, NA and H.

Question 6

tidy_covid <- data |>
  pivot_longer(
    cols = -c(provincia_iso, fecha), # Keep these columns
    names_to = "age_sex", 
    values_to = "cases"  
  ) |>
  mutate(
    age_inf = str_extract(age_sex, "^[0-9]+"), # we create age_inf  
    age_up = str_extract(age_sex, "(?<=-)[0-9]+"), #we create age_up
    sex = str_extract(age_sex, "(?<=_)[A-Za-z]+"), #we create sex
    age_inf = as.numeric(age_inf),
    age_up = as.numeric(age_up)  
  ) |>
  #special cases like "80+" and "NC"
  mutate(
    age_up = ifelse(str_detect(age_sex, "\\+"), NA, age_up), #NA if the format includes "+"
    age_inf = ifelse(str_detect(age_sex, "^NC"), NA, age_inf), #NA if it's "NC"
    age_up = ifelse(str_detect(age_sex, "^NC"), NA, age_up),    #NA if it's "NC"
    sex = ifelse(sex == "NC", NA, sex) #NA if it's "NC"
  )

print (tidy_covid, width = Inf)
# A tibble: 1,299,030 × 7
   provincia_iso fecha               age_sex cases age_inf age_up sex  
   <chr>         <dttm>              <chr>   <dbl>   <dbl>  <dbl> <chr>
 1 A             2020-01-01 00:00:00 0-9_H      NA       0      9 H    
 2 A             2020-01-01 00:00:00 10-19_H    NA      10     19 H    
 3 A             2020-01-01 00:00:00 20-29_H    NA      20     29 H    
 4 A             2020-01-01 00:00:00 30-39_H    NA      30     39 H    
 5 A             2020-01-01 00:00:00 40-49_H    NA      40     49 H    
 6 A             2020-01-01 00:00:00 50-59_H    NA      50     59 H    
 7 A             2020-01-01 00:00:00 60-69_H    NA      60     69 H    
 8 A             2020-01-01 00:00:00 70-79_H    NA      70     79 H    
 9 A             2020-01-01 00:00:00 80+_H      NA      80     NA H    
10 A             2020-01-01 00:00:00 NC_H       NA      NA     NA H    
# ℹ 1,299,020 more rows

Question 6

ANSWER: With this code, what we do is first create a column for age_sex and another for cases to bind all the data so as not to have a lot of innecesary columns. Secondly, we divide the age_sex column (that we just created) into three columns: age_inf, age_up and sex, making sure that the age remains as numerical values. Finally, for special cases such as “80+” or “NC” we set them to missing (NA) I also leave the variable age_sex because we’ll need it in a future exercise to extract the age group

Question 7

Add a new variable month_year to the table that encodes the month and year (for example, any day in January 2020 will be something similar to “1-2020” and any day in February 2021 will be “2-2021”).

Question 7

tidy_covid <- tidy_covid |>
  mutate(
    month_year = paste0(month(fecha), "-", year(fecha)))

print (tidy_covid, width = Inf)
# A tibble: 1,299,030 × 8
   provincia_iso fecha               age_sex cases age_inf age_up sex  
   <chr>         <dttm>              <chr>   <dbl>   <dbl>  <dbl> <chr>
 1 A             2020-01-01 00:00:00 0-9_H      NA       0      9 H    
 2 A             2020-01-01 00:00:00 10-19_H    NA      10     19 H    
 3 A             2020-01-01 00:00:00 20-29_H    NA      20     29 H    
 4 A             2020-01-01 00:00:00 30-39_H    NA      30     39 H    
 5 A             2020-01-01 00:00:00 40-49_H    NA      40     49 H    
 6 A             2020-01-01 00:00:00 50-59_H    NA      50     59 H    
 7 A             2020-01-01 00:00:00 60-69_H    NA      60     69 H    
 8 A             2020-01-01 00:00:00 70-79_H    NA      70     79 H    
 9 A             2020-01-01 00:00:00 80+_H      NA      80     NA H    
10 A             2020-01-01 00:00:00 NC_H       NA      NA     NA H    
   month_year
   <chr>     
 1 1-2020    
 2 1-2020    
 3 1-2020    
 4 1-2020    
 5 1-2020    
 6 1-2020    
 7 1-2020    
 8 1-2020    
 9 1-2020    
10 1-2020    
# ℹ 1,299,020 more rows

Question 7

Import from wikipedia (using code) https://es.wikipedia.org/wiki/ISO_3166-2:ES#Provincias the table containing the ISO codes for each province.

url <- "https://es.wikipedia.org/wiki/ISO_3166-2:ES#Provincias!"
prop_ISO <- read_html(url)
prop_ISO <- html_table(prop_ISO, fill=TRUE)
prop_ISO <- as_tibble(prop_ISO[[2]])  

I had problems with the variable “Nombre de la subdivisión en la ISO[1]” in future exercises, so I try to change it and rename it

colnames(prop_ISO) <- gsub(" ", "_", colnames(prop_ISO))
colnames(prop_ISO) <- gsub("[[:punct:]]", "", colnames(prop_ISO))

prop_ISO <- prop_ISO |>
  rename(provincia = NombredelasubdivisiónenlaISO1,
         CCAA = Comunidadautónoma)

print (prop_ISO)
# A tibble: 50 × 3
   Código        provincia                     CCAA 
   <chr>         <chr>                         <chr>
 1 ES-C          A Coruña (La Coruña)[nota 1]​  GA   
 2 ES-VI[nota 3]​ Araba/Álava[nota 4]​           PV   
 3 ES-AB         Albacete                      CM   
 4 ES-A          Alacant/Alicante[nota 2]​      VC   
 5 ES-AL         Almería                       AN   
 6 ES-O[nota 3]​  Asturias                      AS   
 7 ES-AV         Ávila                         CL   
 8 ES-BA         Badajoz                       EX   
 9 ES-PM[nota 3]​ Balears (Baleares)[nota 1]​    IB   
10 ES-B          Barcelona (Barcelona)[nota 1]​ CT   
# ℹ 40 more rows

Question 7

Some provinces and codes have added [note x], so we cleant it to do a clean join in the next Exercise

prop_ISO <- prop_ISO |>
  mutate(
    provincia = gsub("\\[nota \\d+\\]​?", "", provincia),  
    provincia = gsub("\\s?\\(.*?\\)", "", provincia),   
    provincia = trimws(provincia),
    Código = gsub("\\[nota \\d+\\]​?", "", Código),  
    Código = gsub("\\s?\\(.*?\\)", "", Código),   
    Código = trimws(Código),
    Código = gsub("ES-", "", Código))

print (prop_ISO)
# A tibble: 50 × 3
   Código provincia        CCAA 
   <chr>  <chr>            <chr>
 1 C      A Coruña         GA   
 2 VI     Araba/Álava      PV   
 3 AB     Albacete         CM   
 4 A      Alacant/Alicante VC   
 5 AL     Almería          AN   
 6 O      Asturias         AS   
 7 AV     Ávila            CL   
 8 BA     Badajoz          EX   
 9 PM     Balears          IB   
10 B      Barcelona        CT   
# ℹ 40 more rows

Question 8

Preprocess as you consider the previous table to be able to join that table to the tidy_covid table.

Question 8

Before joining the databases we have to check if the values of Código/provincia_iso match

tidy_covid_vals <- unique(tidy_covid$provincia_iso)
prop_ISO_vals <- unique(prop_ISO$Código)

# Values of tidy_covid that aren't in prop_ISO: CE, ML, NC, missings (NA)
tidy_covid_vals[!tidy_covid_vals %in% prop_ISO_vals]
[1] "CE" "ML" NA   "NC"
#Values of prop_ISO that aren't in tidy_covid: NA
prop_ISO_vals[!prop_ISO_vals %in% tidy_covid_vals] #NA
[1] "NA"

Question 8

We do a left join including Ceuta, Melilla and NC

# Join (we add CE, ML and NC):
final_data <- tidy_covid |>
  left_join(prop_ISO, by = c("provincia_iso" = "Código")) |>
  mutate(
    provincia = case_when(
      provincia_iso == "CE" ~ "Ceuta",
      provincia_iso == "ML" ~ "Melilla",
      provincia_iso == "NC" ~ "NC",
      TRUE ~ provincia
    ),
    CCAA = case_when(
      provincia_iso == "CE" ~ "CE",
      provincia_iso == "ML" ~ "ML",
      provincia_iso == "NC" ~ "NC",
      TRUE ~ CCAA
    )) 

Question 8

Before joining the tables, I have checked that the values of the province codes coincide, and if they don’t coincide I added the corresponding values (specifically in Ceuta, Melilla and NC - No contesta). [I could also have done it with a full_join but then there was a new observation at the end about Navarra that should be eliminated. I didn’t do a inner_join because I would have lost the data for Ceuta, Melilla and NC] In exercise 11 we will be asked to add the values for Navarra to completely complete the database

print (final_data, width = Inf)
# A tibble: 1,299,030 × 10
   provincia_iso fecha               age_sex cases age_inf age_up sex  
   <chr>         <dttm>              <chr>   <dbl>   <dbl>  <dbl> <chr>
 1 A             2020-01-01 00:00:00 0-9_H      NA       0      9 H    
 2 A             2020-01-01 00:00:00 10-19_H    NA      10     19 H    
 3 A             2020-01-01 00:00:00 20-29_H    NA      20     29 H    
 4 A             2020-01-01 00:00:00 30-39_H    NA      30     39 H    
 5 A             2020-01-01 00:00:00 40-49_H    NA      40     49 H    
 6 A             2020-01-01 00:00:00 50-59_H    NA      50     59 H    
 7 A             2020-01-01 00:00:00 60-69_H    NA      60     69 H    
 8 A             2020-01-01 00:00:00 70-79_H    NA      70     79 H    
 9 A             2020-01-01 00:00:00 80+_H      NA      80     NA H    
10 A             2020-01-01 00:00:00 NC_H       NA      NA     NA H    
   month_year provincia        CCAA 
   <chr>      <chr>            <chr>
 1 1-2020     Alacant/Alicante VC   
 2 1-2020     Alacant/Alicante VC   
 3 1-2020     Alacant/Alicante VC   
 4 1-2020     Alacant/Alicante VC   
 5 1-2020     Alacant/Alicante VC   
 6 1-2020     Alacant/Alicante VC   
 7 1-2020     Alacant/Alicante VC   
 8 1-2020     Alacant/Alicante VC   
 9 1-2020     Alacant/Alicante VC   
10 1-2020     Alacant/Alicante VC   
# ℹ 1,299,020 more rows

Question 9

Using the previous group variable month_year obtain a summary that returns in a tibble, for each province and each month-year, the daily average cases (regardless of age and sex).

Question 9

summary_cases <- final_data |>
  group_by(provincia_iso, month_year) |>
  summarise(
    total_cases = sum(cases, na.rm = TRUE),
    num_days = n_distinct(fecha),
    daily_cases = total_cases / num_days,
    .groups = "drop"  #Avoid the groups from being there after summarizing
  )

print (summary_cases, width = Inf)
# A tibble: 1,431 × 5
   provincia_iso month_year total_cases num_days daily_cases
   <chr>         <chr>            <dbl>    <int>       <dbl>
 1 A             1-2020               0       31         0  
 2 A             1-2021           74835       31      2414. 
 3 A             1-2022          172951       31      5579. 
 4 A             10-2020          10364       31       334. 
 5 A             10-2021           2081       31        67.1
 6 A             11-2020          14139       30       471. 
 7 A             11-2021           6108       30       204. 
 8 A             12-2020          18136       31       585. 
 9 A             12-2021          49708       31      1603. 
10 A             2-2020               0       29         0  
# ℹ 1,421 more rows
class (summary_cases)
[1] "tbl_df"     "tbl"        "data.frame"

Question 9

ANSWER:First we group by province and date since this way we can do specific calculations in those groups. Once grouped, we calculate the average number of cases per day (total cases/number of days). We use .groups = “drop” so that the data frame returns to the original state (without groups).

Question 10

Design a function summ_by_dates_prov() that, given a vector of ISO codes of provinces, and a table with an id of province, another of date (we don’t care about the month-year because we are going to create it inside the function) and another of cases, returns the average of daily cases that there were each month-year (regardless of sex or age) in the provided provinces. Apply after the function to the ISO codes of the prov_allowed vector below and check that it gives you the same as before (it should…)

Question 10

prov_allowed <- c("M", "B", "V", "SE", "Z", "MA")

summ_by_dates_prov <- function(prov_codes, data_table) {
  data_table |> 
    filter(provincia_iso %in% prov_codes) |> #Filter provinces that coincide with prov_codes
    mutate(month_year = format(fecha, "%m-%Y")) |>
    group_by(provincia_iso, month_year) |> #Calculate the total cases and the number of unique days per group
    summarise(
      total_cases = sum(cases, na.rm = TRUE),
      num_days = n_distinct(fecha),
      daily_avg_cases = total_cases / num_days,
      .groups = "drop" 
    )
} 

result <- summ_by_dates_prov(prov_allowed, final_data)

print (result, width = Inf)
# A tibble: 162 × 5
   provincia_iso month_year total_cases num_days daily_avg_cases
   <chr>         <chr>            <dbl>    <int>           <dbl>
 1 B             01-2020              1       31          0.0323
 2 B             01-2021          74662       31       2408.    
 3 B             01-2022         633958       31      20450.    
 4 B             02-2020              6       29          0.207 
 5 B             02-2021          35824       28       1279.    
 6 B             02-2022         127595       28       4557.    
 7 B             03-2020          19365       31        625.    
 8 B             03-2021          29919       31        965.    
 9 B             03-2022          55939       27       2072.    
10 B             04-2020          19910       30        664.    
# ℹ 152 more rows

Question 10

ANSWER: We get the same results as before for these provinces (prov_allowed). To check it in the previous exercise we have to see the results for the provincia_iso = “M”, “B”, “V”, “SE”, “Z”, “MA”, and the results are the same. In this code we specify a function that filters the data by a specific set of provinces, calculates the daily average of cases for each month and province, and returns a summary with these averages. Then, we apply it to our database and the provinces we want.

Question 11

Run the code you consider to properly recode the province ISO codes (right now Navarra is as NA; look for what should be missing and fix it).

Question 11

final_data <- final_data |>
  mutate(
    provincia_iso = ifelse(is.na(provincia_iso), "NA", provincia_iso),
    provincia = ifelse(provincia_iso == "NA", "Navarra", provincia),
    CCAA = ifelse(provincia_iso == "NA", "NA", CCAA)
  )

print (final_data, width = Inf)
# A tibble: 1,299,030 × 10
   provincia_iso fecha               age_sex cases age_inf age_up sex  
   <chr>         <dttm>              <chr>   <dbl>   <dbl>  <dbl> <chr>
 1 A             2020-01-01 00:00:00 0-9_H      NA       0      9 H    
 2 A             2020-01-01 00:00:00 10-19_H    NA      10     19 H    
 3 A             2020-01-01 00:00:00 20-29_H    NA      20     29 H    
 4 A             2020-01-01 00:00:00 30-39_H    NA      30     39 H    
 5 A             2020-01-01 00:00:00 40-49_H    NA      40     49 H    
 6 A             2020-01-01 00:00:00 50-59_H    NA      50     59 H    
 7 A             2020-01-01 00:00:00 60-69_H    NA      60     69 H    
 8 A             2020-01-01 00:00:00 70-79_H    NA      70     79 H    
 9 A             2020-01-01 00:00:00 80+_H      NA      80     NA H    
10 A             2020-01-01 00:00:00 NC_H       NA      NA     NA H    
   month_year provincia        CCAA 
   <chr>      <chr>            <chr>
 1 1-2020     Alacant/Alicante VC   
 2 1-2020     Alacant/Alicante VC   
 3 1-2020     Alacant/Alicante VC   
 4 1-2020     Alacant/Alicante VC   
 5 1-2020     Alacant/Alicante VC   
 6 1-2020     Alacant/Alicante VC   
 7 1-2020     Alacant/Alicante VC   
 8 1-2020     Alacant/Alicante VC   
 9 1-2020     Alacant/Alicante VC   
10 1-2020     Alacant/Alicante VC   
# ℹ 1,299,020 more rows

Question 11

ANSWER:In addition to putting “NA” in those that were NA (missings), we also added the corresponding values for Navarra in the province and CCAA. Now we have the database with all the provinces included.

Question 12

With the database generated in the previous exercise, calculate the proportion of cases with unknown sex. Do the same with unknown province. Eliminate such records if the number of cases represents less than 1% (for each).

Question 12

#Unknown sex and provinces
final_data |> 
  mutate(
    unknown_sex = is.na(sex),
    unknown_province = is.na(final_data$provincia_iso) | final_data$provincia_iso == "NC"
  ) |> 
  summarise(
    total_cases = sum(cases, na.rm = TRUE),
    unknown_sex_cases = sum(cases[unknown_sex], na.rm = TRUE),
    unknown_province_cases = sum(cases[unknown_province], na.rm = TRUE),
    prop_unknown_sex = unknown_sex_cases / total_cases,
    prop_unknown_province = unknown_province_cases / total_cases
  ) |> 
  print()
# A tibble: 1 × 5
  total_cases unknown_sex_cases unknown_province_cases prop_unknown_sex
        <dbl>             <dbl>                  <dbl>            <dbl>
1    11585186              9033                  39952         0.000780
# ℹ 1 more variable: prop_unknown_province <dbl>

Question 12

ANSWER: 0,000779% of all the cases have unknown sex and 0,003448% of the cases have unknown provinces.

Question 12

Due to the fact that in both cases (sex and provinces), the number of cases represents less than 1% (for each) in the database, we just eliminate them

final_data <- final_data |> 
  filter(!is.na(sex) & !is.na(provincia_iso))

Question 13

Create a new variable called cum_cases containing the accumulated cases for each date, disaggregated by province, age group and sex.

Question 13

# we create again the variable "age_group"
final_data <- final_data |>
  separate(
    age_sex,
    into = c("age_group", "sexo"), 
    sep = "_",
    remove = FALSE) |>
  select (-sexo, -age_sex)

final_data <- final_data |>
  group_by(provincia_iso, age_group, sex, fecha) |>
  mutate(
    cum_cases = cumsum(replace_na(cases, 0))  # reeplace NA with 0 only in the cumulative sum
  ) |>
  ungroup()

print (final_data, width = Inf)
# A tibble: 866,020 × 11
   provincia_iso fecha               age_group cases age_inf age_up sex  
   <chr>         <dttm>              <chr>     <dbl>   <dbl>  <dbl> <chr>
 1 A             2020-01-01 00:00:00 0-9          NA       0      9 H    
 2 A             2020-01-01 00:00:00 10-19        NA      10     19 H    
 3 A             2020-01-01 00:00:00 20-29        NA      20     29 H    
 4 A             2020-01-01 00:00:00 30-39        NA      30     39 H    
 5 A             2020-01-01 00:00:00 40-49        NA      40     49 H    
 6 A             2020-01-01 00:00:00 50-59        NA      50     59 H    
 7 A             2020-01-01 00:00:00 60-69        NA      60     69 H    
 8 A             2020-01-01 00:00:00 70-79        NA      70     79 H    
 9 A             2020-01-01 00:00:00 80+          NA      80     NA H    
10 A             2020-01-01 00:00:00 NC           NA      NA     NA H    
   month_year provincia        CCAA  cum_cases
   <chr>      <chr>            <chr>     <dbl>
 1 1-2020     Alacant/Alicante VC            0
 2 1-2020     Alacant/Alicante VC            0
 3 1-2020     Alacant/Alicante VC            0
 4 1-2020     Alacant/Alicante VC            0
 5 1-2020     Alacant/Alicante VC            0
 6 1-2020     Alacant/Alicante VC            0
 7 1-2020     Alacant/Alicante VC            0
 8 1-2020     Alacant/Alicante VC            0
 9 1-2020     Alacant/Alicante VC            0
10 1-2020     Alacant/Alicante VC            0
# ℹ 866,010 more rows

Question 13

ANSWER: The code groups the records by province, date, age group and sex (we first created the “age group” variable). It then calculates the cumulative sum of cases (cum_cases) within each group (we substitute with 0 the “NA” in cases so there are no problems in the sum. Finally, it ungroups the data to prevent them from remaining grouped after the operation.

Question 14

What were the 7 provinces with the most cases throughout the pandemic? And the 5 provinces with the fewest deaths? And if we disaggregate by sex? We don’t know the number of deaths so we only do the part of the exercise that we can do

Question 14

province_cases <- final_data |>
  group_by(provincia_iso) |>
  summarise(total_cases = sum(cases, na.rm = TRUE)) |>
  arrange(desc(total_cases))

province_cases |>
  head(7)
# A tibble: 7 × 2
  provincia_iso total_cases
  <chr>               <dbl>
1 B                 1756033
2 M                 1640978
3 V                  728118
4 A                  467584
5 MU                 395267
6 BI                 335126
7 Z                  296091

Question 14

ANSWER: The top 7 provinces with more cases are Barcelona, Madrid, Valencia, Alicante, Murcia, Biscaia and Zaragoza. This makes sense because they are provinces with a large number of inhabitants, so there will also be a large number of infected people.

Question 14

And if we desaggregate by sex:

province_sex_cases <- final_data |>
  group_by(provincia_iso, sex) |>
  summarise(total_cases = sum(cases, na.rm = TRUE)) |>
arrange(desc(total_cases))

province_sex_cases |>
  head(7)
# A tibble: 7 × 3
# Groups:   provincia_iso [4]
  provincia_iso sex   total_cases
  <chr>         <chr>       <dbl>
1 B             M          932573
2 M             M          876472
3 B             H          823460
4 M             H          764506
5 V             M          383696
6 V             H          344422
7 A             M          247004

Question 14

ANSWER: The top 7 observations with more cases grouped by sex are also Barcelona, Madrid, Valencia and Alicante, the same as the exercise before.

Question 15

Use the {datapasta} package to import the population table of the provinces by copying from https://www.ine.es/jaxiT3/Datos.htm?t=2852. Incorporate this info into the table as you see fit.

Question 15

In the menu I used “Addins” and inside the datapasta package it appeared “paste as tribble” and generated this code:

library(datapasta)
population <- tibble::tribble(
                               ~provincia, ~poblacion,
                                 "Total",  47385107,
                              "Albacete",    386464,
                      "Alacant/Alicante",   1881762,
                               "Almería",    731792,
                           "Araba/Álava",    333626,
                              "Asturias",   1011792,
                                 "Ávila",    158421,
                               "Badajoz",    669943,
                         "Balears Illes",   1173008,
                             "Barcelona",   5714730,
                               "Bizkaia",   1154334,
                                "Burgos",    356055,
                               "Cáceres",    389558,
                                 "Cádiz",   1245960,
                             "Cantabria",    584507,
                    "Castellón/Castelló",    587064,
                           "Ciudad Real",    492591,
                               "Córdoba",    776789,
                              "Coruña A",   1120134,
                                "Cuenca",    195516,
                              "Gipuzkoa",    726033,
                                "Girona",    786596,
                               "Granada",    921338,
                           "Guadalajara",    265588,
                                "Huelva",    525835,
                                "Huesca",    224264,
                                  "Jaén",    627190,
                                  "León",    451706,
                                "Lleida",    439727,
                                  "Lugo",    326013,
                                "Madrid",   6751251,
                                "Málaga",   1695651,
                                "Murcia",   1518486,
                               "Navarra",    661537,
                               "Ourense",    305223,
                              "Palencia",    159123,
                            "Palmas Las",   1128539,
                            "Pontevedra",    944275,
                              "Rioja La",    319796,
                             "Salamanca",    327338,
                "Santa Cruz de Tenerife",   1044405,
                               "Segovia",    153663,
                               "Sevilla",   1947852,
                                 "Soria",     88747,
                             "Tarragona",    822309,
                                "Teruel",    134545,
                                "Toledo",    709403,
                     "Valencia/València",   2589312,
                            "Valladolid",    519361,
                                "Zamora",    168725,
                              "Zaragoza",    967452,
                                 "Ceuta",     83517,
                               "Melilla",     86261
                )

Question 15

I was having trouble using the “datapaste” package so I generated the code from Addins and got the first code that uses tibble:tribble. Then I do the join with the database final_data

Question 15

Before joining the databases we have to check if the values match

final_data_vals <- unique(final_data$provincia)
population_vals <- unique(population$provincia)

# Values of final_data that aren't in population: A Coruña, Las Palmas, NC, València/Valencia, Castelló/Castellón, La Rioja, Balears
final_data_vals[!final_data_vals %in% population_vals]
[1] "A Coruña"           "Castelló/Castellón" "Las Palmas"        
[4] "La Rioja"           "NC"                 "Balears"           
[7] "València/Valencia" 
#Values of population that aren't in final_data: Total, castellón/Castelló, Palmas Las, Valencia/València, Balears Illes, Coruña A, Rioja La
population_vals[!population_vals %in% final_data_vals] 
[1] "Total"              "Balears Illes"      "Castellón/Castelló"
[4] "Coruña A"           "Palmas Las"         "Rioja La"          
[7] "Valencia/València" 
equivalences <- c(
  "A Coruña" = "Coruña A",
  "Las Palmas" = "Palmas Las",
  "NC" = "NC",
  "València/Valencia" = "Valencia/València",
  "Castelló/Castellón" = "Castellón/Castelló",
  "La Rioja" = "Rioja La",
  "Balears" = "Balears Illes"
)

Question 15

We join both databases taking into account the non matching values

final_data <- final_data |>
  mutate(provincia = ifelse(provincia %in% names(equivalences), 
                            equivalences[provincia], 
                            provincia))

final_data <- final_data |>
  left_join(population, by = "provincia")

print (final_data, width = Inf)
# A tibble: 866,020 × 12
   provincia_iso fecha               age_group cases age_inf age_up sex  
   <chr>         <dttm>              <chr>     <dbl>   <dbl>  <dbl> <chr>
 1 A             2020-01-01 00:00:00 0-9          NA       0      9 H    
 2 A             2020-01-01 00:00:00 10-19        NA      10     19 H    
 3 A             2020-01-01 00:00:00 20-29        NA      20     29 H    
 4 A             2020-01-01 00:00:00 30-39        NA      30     39 H    
 5 A             2020-01-01 00:00:00 40-49        NA      40     49 H    
 6 A             2020-01-01 00:00:00 50-59        NA      50     59 H    
 7 A             2020-01-01 00:00:00 60-69        NA      60     69 H    
 8 A             2020-01-01 00:00:00 70-79        NA      70     79 H    
 9 A             2020-01-01 00:00:00 80+          NA      80     NA H    
10 A             2020-01-01 00:00:00 NC           NA      NA     NA H    
   month_year provincia        CCAA  cum_cases poblacion
   <chr>      <chr>            <chr>     <dbl>     <dbl>
 1 1-2020     Alacant/Alicante VC            0   1881762
 2 1-2020     Alacant/Alicante VC            0   1881762
 3 1-2020     Alacant/Alicante VC            0   1881762
 4 1-2020     Alacant/Alicante VC            0   1881762
 5 1-2020     Alacant/Alicante VC            0   1881762
 6 1-2020     Alacant/Alicante VC            0   1881762
 7 1-2020     Alacant/Alicante VC            0   1881762
 8 1-2020     Alacant/Alicante VC            0   1881762
 9 1-2020     Alacant/Alicante VC            0   1881762
10 1-2020     Alacant/Alicante VC            0   1881762
# ℹ 866,010 more rows

Question 15

ANSWER: I joined both databases but taking into account that some provinces do not coincide at all, such as “Las Palmas” and “Palmas Las” or “València/Valencia” and “Valencia/València”, so for these I created a vector of equivalences that I replaced in one of the databases to be able to join them without there being missing values

Question 16

Define a function called cum_incidence() that, given as arguments an ordered vector (by date) of cases and another parameter \(k\), calculates the cumulative incidence at \(k\) days (understood as the number of cases in the last \(k\) days per 100000 habitants). Make use of this function and create a new variable representing the cumulative incidence, in each age group, sex and province. Then, determine the 5 provinces with the highest cumulative incidence in women over 80 years of age as of March 1, 2022.

Question 16

#Cumulative incidence: we  create function
cum_incidence <- function(cases, k, poblacion){
  cases <- ifelse(is.na(cases), 0, cases)
  cases <- as.numeric(cases)
  poblacion <- as.numeric(poblacion)
  roll_sum <- rollsum(cases, k, fill = NA, align = "right")
  incidence <- (roll_sum / poblacion) * 100000
  return(incidence)
}

#Function applied
covid_incidence <- final_data |>
  group_by(provincia, age_group, sex) |>
  mutate(
    cum_incidence = cum_incidence(cases, k = 7, poblacion = poblacion)) |>
  ungroup()

Question 16

covid_incidence |>
  filter(age_group == "80+", 
         sex == "M",  
         fecha == as.Date("2022-03-01")) |>
  arrange (desc(cum_incidence)) |>
  slice_head (n=5)
# A tibble: 5 × 13
  provincia_iso fecha               age_group cases age_inf age_up sex  
  <chr>         <dttm>              <chr>     <dbl>   <dbl>  <dbl> <chr>
1 LU            2022-03-01 00:00:00 80+          15      80     NA M    
2 ZA            2022-03-01 00:00:00 80+           2      80     NA M    
3 OR            2022-03-01 00:00:00 80+          12      80     NA M    
4 AV            2022-03-01 00:00:00 80+           2      80     NA M    
5 O             2022-03-01 00:00:00 80+          16      80     NA M    
# ℹ 6 more variables: month_year <chr>, provincia <chr>, CCAA <chr>,
#   cum_cases <dbl>, poblacion <dbl>, cum_incidence <dbl>

Question 16

ANSWER: We define the function cum_incidence(), within which we calculate the cumulative incidence of the last \(k\) days, dividing the total number of cases in that period by the population of each province, and then multiplying by 100,000 to obtain the rate per every 100,000 inhabitants. When applying the function we consider 7 days (because a week has 7 days). And then we filter for the profile that interests us (women over 80 years of age as of March 1, 2022.) The 5 provinces with the highest cumulative incidence in women over 80 years of age as of March 1, 2022 are Lugo, Zamora, Ourense, Ávila and Asturias

Question 17

The last question does not exist and it is up to you to formulate and solve it. What question can you think of that could be interesting to ask with the current database? Why? What would be its resolution? You have absolute creative freedom to do so.

QUESTION How has the incidence rate of COVID-19 evolved over time? And what’s the difference in the incidence of Covid-19 in the different CCAA in 2020? Also analize the differences between sexes and age groups

Question 17

COVID-19 OVER TIME

#First we want to see the evolution of the incidence rate of COVID-19
covid_data <- final_data |>
  mutate(
    tasa_incidencia = (cases / poblacion) * 100000,
    month_year_new = as.Date(paste0(year(fecha), "-", month(fecha), "-01"))) |>
  filter(!is.na(sex))

plot <- ggplot(covid_data, aes(x = month_year_new, y = tasa_incidencia, color = sex)) +
  geom_line() +
  labs(title = "Evolution of the incidence rate of COVID-19",
       x = "Month",
       y = "Incidence rate per 100,000 inhabitants") +
  theme_minimal() +
  scale_x_date(date_labels = "%b-%Y", date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Question 17

print (plot)

Question 17

ANSWER:Since March 2020, we see that the incidence rate of COvid-19 is increasing as time passes although we see several drops: June 2020, March, April and October 2021. We also see that there were spikes in covid cases in July-August 2021 and also December 2021-January 2022. We cannot say much about the differences by sex since for many cases there is no data for one of the sexes, and sometimes the rates for a certain sex are very very low compared to the other sex, which really shows that data is missing (I highly doubt it is because there were really significant differences between sexes).

Question 17

COVID-19 BY CCAA

#In this case we'll see the differences using ANOVA. 
covid_data_2020 <- covid_data |> 
  filter(year(month_year_new) == 2020)

anova_result <- aov(tasa_incidencia ~ CCAA, data = covid_data_2020)
summary(anova_result)
                Df Sum Sq Mean Sq F value Pr(>F)    
CCAA            18  33375  1854.2   987.9 <2e-16 ***
Residuals   174108 326781     1.9                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
213833 observations deleted due to missingness

ANSWER: There are significative differences between CCAA incidence rate of Covid-19 in the year 2020

Question 17

posthoc_result <- TukeyHSD(anova_result)
posthoc_result
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = tasa_incidencia ~ CCAA, data = covid_data_2020)

$CCAA
              diff         lwr          upr     p adj
AR-AN  0.610636031  0.55437502  0.666897040 0.0000000
AS-AN -0.174332544 -0.26384542 -0.084819666 0.0000000
CB-AN -0.086861013 -0.17511533  0.001393302 0.0597343
CE-AN  1.642582727  1.51026991  1.774895539 0.0000000
CL-AN  0.835123397  0.79271150  0.877535292 0.0000000
CM-AN  0.488666217  0.44114852  0.536183909 0.0000000
CN-AN -0.577161172 -0.64680332 -0.507519025 0.0000000
CT-AN  0.099308565  0.05192197  0.146695161 0.0000000
EX-AN  0.126474915  0.05908478  0.193865047 0.0000000
GA-AN -0.228642980 -0.28168323 -0.175602726 0.0000000
IB-AN -0.163739080 -0.24801419 -0.079463973 0.0000000
MC-AN  0.100592235  0.01554769  0.185636777 0.0046148
MD-AN  0.148335553  0.07575068  0.220920425 0.0000000
ML-AN  2.128180304  2.00722921  2.249131394 0.0000000
NA-AN  0.537110043  0.45684037  0.617379713 0.0000000
PV-AN  0.317531562  0.26423403  0.370829099 0.0000000
RI-AN  0.676623272  0.58917800  0.764068540 0.0000000
VC-AN -0.154938197 -0.20897702 -0.100899372 0.0000000
AS-AR -0.784968574 -0.88181211 -0.688125043 0.0000000
CB-AR -0.697497044 -0.79317849 -0.601815601 0.0000000
CE-AR  1.031946696  0.89456841  1.169324986 0.0000000
CL-AR  0.224487367  0.16823007  0.280744660 0.0000000
CM-AR -0.121969814 -0.18216986 -0.061769769 0.0000000
CN-AR -1.187797203 -1.26663972 -1.108954683 0.0000000
CT-AR -0.511327466 -0.57142409 -0.451230845 0.0000000
EX-AR -0.484161116 -0.56102167 -0.407300566 0.0000000
GA-AR -0.839279011 -0.90392716 -0.774630864 0.0000000
IB-AR -0.774375111 -0.86639906 -0.682351157 0.0000000
MC-AR -0.510043796 -0.60277291 -0.417314682 0.0000000
MD-AR -0.462300478 -0.54375401 -0.380846946 0.0000000
ML-AR  1.517544274  1.39107182  1.644016723 0.0000000
NA-AR -0.073525987 -0.16189642  0.014844450 0.2556200
PV-AR -0.293104469 -0.35796387 -0.228245069 0.0000000
RI-AR  0.065987241 -0.02894847  0.160922952 0.6010281
VC-AR -0.765574227 -0.83104414 -0.700104319 0.0000000
CB-AS  0.087471531 -0.03085933  0.205802391 0.4797426
CE-AS  1.816915270  1.66290213  1.970928414 0.0000000
CL-AS  1.009455941  0.91994540  1.098966483 0.0000000
CM-AS  0.662998761  0.57095910  0.755038423 0.0000000
CN-AS -0.402828628 -0.50801131 -0.297645945 0.0000000
CT-AS  0.273641109  0.18166906  0.365613158 0.0000000
EX-AS  0.300807458  0.19710211  0.404512802 0.0000000
GA-AS -0.054310437 -0.14931905  0.040698179 0.8830184
IB-AS  0.010593463 -0.10480005  0.125986973 1.0000000
MC-AS  0.274924779  0.15896814  0.390881419 0.0000000
MD-AS  0.322668097  0.21551432  0.429821874 0.0000000
ML-AS  2.302512848  2.15814341  2.446882285 0.0000000
NA-AS  0.711442587  0.59894108  0.823944090 0.0000000
PV-AS  0.491864106  0.39671162  0.587016593 0.0000000
RI-AS  0.850955815  0.73322713  0.968684500 0.0000000
VC-AS  0.019394347 -0.07617533  0.114964023 0.9999999
CE-CB  1.729443740  1.57615865  1.882728825 0.0000000
CL-CB  0.921984410  0.83373246  1.010236357 0.0000000
CM-CB  0.575527230  0.48471111  0.666343354 0.0000000
CN-CB -0.490300159 -0.59441387 -0.386186443 0.0000000
CT-CB  0.186169578  0.09542198  0.276917177 0.0000000
EX-CB  0.213335928  0.11071494  0.315956915 0.0000000
GA-CB -0.141781967 -0.23560577 -0.047958165 0.0000178
IB-CB -0.076878067 -0.19129804  0.037541909 0.6636593
MC-CB  0.187453248  0.07246537  0.302441123 0.0000017
MD-CB  0.235196566  0.12909190  0.341301235 0.0000000
ML-CB  2.215041317  2.07144883  2.358633808 0.0000000
NA-CB  0.623971056  0.51246833  0.735473778 0.0000000
PV-CB  0.404392575  0.31042309  0.498362063 0.0000000
RI-CB  0.763484285  0.64670966  0.880258906 0.0000000
VC-CB -0.068077184 -0.16246909  0.026314722 0.5289833
CL-CE -0.807459329 -0.93977056 -0.675148097 0.0000000
CM-CE -1.153916510 -1.28795167 -1.019881348 0.0000000
CN-CE -2.219743899 -2.36312284 -2.076364953 0.0000000
CT-CE -1.543274162 -1.67726290 -1.409285420 0.0000000
EX-CE -1.516107812 -1.65840652 -1.373809101 0.0000000
GA-CE -1.871225707 -2.00731671 -1.735134701 0.0000000
IB-CE -1.806321807 -1.95735090 -1.655292713 0.0000000
MC-CE -1.541990492 -1.69345028 -1.390530703 0.0000000
MD-CE -1.494247174 -1.63907831 -1.349416041 0.0000000
ML-CE  0.485597578  0.31142336  0.659771799 0.0000000
NA-CE -1.105472683 -1.25430384 -0.956641527 0.0000000
PV-CE -1.325051165 -1.46124265 -1.188859679 0.0000000
RI-CE -0.965959455 -1.11878016 -0.813138749 0.0000000
VC-CE -1.797520923 -1.93400421 -1.661037635 0.0000000
CM-CL -0.346457180 -0.39397047 -0.298943887 0.0000000
CN-CL -1.412284569 -1.48192371 -1.342645425 0.0000000
CT-CL -0.735814832 -0.78319702 -0.688432648 0.0000000
EX-CL -0.708648483 -0.77603551 -0.641261453 0.0000000
GA-CL -1.063766378 -1.11680269 -1.010730064 0.0000000
IB-CL -0.998862478 -1.08313510 -0.914589851 0.0000000
MC-CL -0.734531162 -0.81957325 -0.649489078 0.0000000
MD-CL -0.686787844 -0.75936984 -0.614205853 0.0000000
ML-CL  1.293056907  1.17210755  1.414006268 0.0000000
NA-CL -0.298013354 -0.37828042 -0.217746289 0.0000000
PV-CL -0.517591835 -0.57088545 -0.464298221 0.0000000
RI-CL -0.158500126 -0.24594300 -0.071057248 0.0000000
VC-CL -0.990061594 -1.04409655 -0.936026638 0.0000000
CN-CM -1.065827389 -1.13868871 -0.992966063 0.0000000
CT-CM -0.389357652 -0.44135990 -0.337355404 0.0000000
EX-CM -0.362191302 -0.43290321 -0.291479393 0.0000000
GA-CM -0.717309197 -0.77451071 -0.660107686 0.0000000
IB-CM -0.652405297 -0.73935953 -0.565451068 0.0000000
MC-CM -0.388073982 -0.47577414 -0.300373820 0.0000000
MD-CM -0.340330664 -0.41600964 -0.264651685 0.0000000
ML-CM  1.639514087  1.51668123  1.762346942 0.0000000
NA-CM  0.048443826 -0.03463423  0.131521886 0.8639143
PV-CM -0.171134655 -0.22857481 -0.113694498 0.0000000
RI-CM  0.187957055  0.09792695  0.277987156 0.0000000
VC-CM -0.643604414 -0.70173305 -0.585475773 0.0000000
CT-CN  0.676469737  0.60369384  0.749245634 0.0000000
EX-CN  0.703636087  0.61650244  0.790769729 0.0000000
GA-CN  0.348518192  0.27194071  0.425095671 0.0000000
IB-CN  0.413422092  0.31265932  0.514184860 0.0000000
MC-CN  0.677753407  0.57634623  0.779160586 0.0000000
MD-CN  0.725496725  0.63428595  0.816707499 0.0000000
ML-CN  2.705341476  2.57237529  2.838307667 0.0000000
NA-CN  1.114271215  1.01683374  1.211708690 0.0000000
PV-CN  0.894692734  0.81793683  0.971448640 0.0000000
RI-CN  1.253784444  1.15035564  1.357213244 0.0000000
VC-CN  0.422222975  0.34495049  0.499495456 0.0000000
EX-CT  0.027166350 -0.04345753  0.097790230 0.9980731
GA-CT -0.327951545 -0.38504420 -0.270858890 0.0000000
IB-CT -0.263047645 -0.34993030 -0.176164987 0.0000000
MC-CT  0.001283670 -0.08634553  0.088912871 1.0000000
MD-CT  0.049026988 -0.02656975  0.124623722 0.7238106
ML-CT  2.028871739  1.90608954  2.151653939 0.0000000
NA-CT  0.437801478  0.35479833  0.520804626 0.0000000
PV-CT  0.218222997  0.16089124  0.275554751 0.0000000
RI-CT  0.577314707  0.48735373  0.667275685 0.0000000
VC-CT -0.254246762 -0.31226829 -0.196225238 0.0000000
GA-EX -0.355117895 -0.42965320 -0.280582590 0.0000000
IB-EX -0.290213995 -0.38943363 -0.190994356 0.0000000
MC-EX -0.025882680 -0.12575669  0.073991327 0.9999928
MD-EX  0.021860638 -0.06764247  0.111363748 0.9999971
ML-EX  2.001705390  1.86990475  2.133506031 0.0000000
NA-EX  0.410635129  0.31479431  0.506475949 0.0000000
PV-EX  0.191056647  0.11633804  0.265775256 0.0000000
RI-EX  0.550148357  0.44822232  0.652074399 0.0000000
VC-EX -0.281413111 -0.35666228 -0.206163941 0.0000000
IB-GA  0.064903900 -0.02518703  0.154994829 0.5311649
MC-GA  0.329235215  0.23842412  0.420046315 0.0000000
MD-GA  0.376978533  0.29771539  0.456241673 0.0000000
ML-GA  2.356823284  2.23175031  2.481896255 0.0000000
NA-GA  0.765753024  0.67939735  0.852108693 0.0000000
PV-GA  0.546174542  0.48408824  0.608260846 0.0000000
RI-GA  0.905266252  0.81220307  0.998329438 0.0000000
VC-GA  0.073704784  0.01098097  0.136428594 0.0051632
MC-IB  0.264331315  0.15236846  0.376294171 0.0000000
MD-IB  0.312074633  0.20925600  0.414893262 0.0000000
ML-IB  2.291919385  2.15073768  2.433101087 0.0000000
NA-IB  0.700849124  0.59246865  0.809229596 0.0000000
PV-IB  0.481270642  0.39102800  0.571513284 0.0000000
RI-IB  0.840362352  0.72656524  0.954159460 0.0000000
VC-IB  0.008800884 -0.08188154  0.099483305 1.0000000
MD-MC  0.047743318 -0.05570692  0.151193552 0.9838513
ML-MC  2.027588069  1.88594573  2.169230412 0.0000000
NA-MC  0.436517808  0.32753796  0.545497656 0.0000000
PV-MC  0.216939327  0.12597772  0.307900938 0.0000000
RI-MC  0.576031037  0.46166294  0.690399136 0.0000000
VC-MC -0.255530432 -0.34692836 -0.164132500 0.0000000
ML-MD  1.979844751  1.84531393  2.114375576 0.0000000
NA-MD  0.388774490  0.28921247  0.488336514 0.0000000
PV-MD  0.169196009  0.08976047  0.248631543 0.0000000
RI-MD  0.528287719  0.42285503  0.633720406 0.0000000
VC-MD -0.303273750 -0.38320854 -0.223338956 0.0000000
NA-ML -1.591070261 -1.72989821 -1.452242313 0.0000000
PV-ML -1.810648742 -1.93583104 -1.685466449 0.0000000
RI-ML -1.451557032 -1.59465369 -1.308460372 0.0000000
VC-ML -2.283118501 -2.40861820 -2.157618804 0.0000000
PV-NA -0.219578481 -0.30609241 -0.133064550 0.0000000
RI-NA  0.139513229  0.02864976  0.250376694 0.0014704
VC-NA -0.692048240 -0.77902081 -0.605075672 0.0000000
RI-PV  0.359091710  0.26588165  0.452301771 0.0000000
VC-PV -0.472469759 -0.53541128 -0.409528237 0.0000000
VC-RI -0.831561468 -0.92519737 -0.737925564 0.0000000

Question 17

ANSWER: In general we see that the differences in the incidence rate of covid-19 are significant between all the Autonomous Communities, except in very few cases where there is not significative differences, such as, Castilla y León - Asturias / Castilla y León - Andalusia / Valencia - Asturias / Balearic Islands - Castilla y León / Valencia - Castilla y León / Madrid - Murcia / Madrid - Catalonia / Navarra - Castilla La Mancha / Aragón - Navarra / La Rioja - Aragón/ Balearic Islands - Galicia / Galicia - Asturias.

Question 17

COVID-19 BY AGE AND SEX

# We are going to do a regression. We prepare our data

#First we clean the data
covid_data_clean <- covid_data |>
  filter(sex != "NC", age_group != "NC") |>
  drop_na(tasa_incidencia, sex, age_group) 

#As factor
covid_data_clean <- covid_data_clean |>
  mutate(
    sex = factor(sex, levels = c("H", "M")),
    age_group = factor(age_group)       
  )

Question 17

# Lineal Regression
modelo_regresion <- lm(tasa_incidencia ~ age_group + sex, data = covid_data_clean)
summary(modelo_regresion)

Call:
lm(formula = tasa_incidencia ~ age_group + sex, data = covid_data_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
 -3.597  -2.054  -1.110   0.010 106.208 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     2.12320    0.02122  100.05   <2e-16 ***
age_group10-19  0.76159    0.02822   26.99   <2e-16 ***
age_group20-29  0.64002    0.02778   23.04   <2e-16 ***
age_group30-39  0.79239    0.02760   28.71   <2e-16 ***
age_group40-49  1.27170    0.02743   46.35   <2e-16 ***
age_group50-59  0.43759    0.02755   15.88   <2e-16 ***
age_group60-69 -0.53398    0.02792  -19.13   <2e-16 ***
age_group70-79 -1.02664    0.02862  -35.87   <2e-16 ***
age_group80+   -0.88369    0.02890  -30.58   <2e-16 ***
sexM            0.21650    0.01295   16.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.693 on 525408 degrees of freedom
Multiple R-squared:  0.0261,    Adjusted R-squared:  0.02608 
F-statistic:  1565 on 9 and 525408 DF,  p-value: < 2.2e-16

Question 17

ANSWER: First we observe that it is a significant model with a p-value below the confidence level (<0.05) although it has a low R-squared (sex and age only manage to explain 2% of the variance of the Covid-19 incidence rates). Even so, we see that both variables are significant in the model as well as all categories.

We can say that with respect to sex, being female (M-mujer) is associated with an increase of 0.2165 units in the incidence rate of COVID-19 compared to being male (H-hombre), holding other variables constant. (as I previously said, that can be explained because there are some missing value, and some of them among males, so there’s a bias)

Question 17

Regarding the age groups we see that the reference age group is 0-9 years, and what we see is that those with positive coefficients (10 to 60 years) indicate that the incidence rates of COVID-19 are higher for these age groups compared to the younger ones (0 to 9 years). In contrast, those older than 60 years would have lower incidence rates than the younger ones. This may be explained by the fact that there was high protection of the elderly during the COVID-19 months.